Team Name : Sidharth Kumar Mohanty
The objective of this project is to find the best neighbourhood or place in Toronto( A city in Canada) to open a start up or Italian restaurant using Foursquare location data. In this project we’ll go through the solution for this problem for avoiding or considering low risk criteria and high success rate.
For this project we need these following data:
Machine Learning, Web Scraping, Foursquare API, Geocoder, Beautiful Soup, Folium
# install geopy to access geocoder package
!pip install geopy
# install beautifulsoup4 for web scraping
!pip install beautifulsoup4
# install requests to gain access to an URL
!pip install requests
# install kmeans for clustering
!pip install kmeans
# install folium for visualization
!pip install folium
# import all necessary libraries
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files
# !conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from bs4 import BeautifulSoup
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
# import k-means from clustering stage
from sklearn.cluster import KMeans
# !conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
print('Libraries imported.')
As the dataset is not available,we will create a dataset of all neighborhoods of Toronto by webscraping.
# Get the neighborhood data using beautiful soup
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
result = requests.get(url)
data_html = BeautifulSoup(result.content)
# read the data into a Pandas Dataframe
soup = BeautifulSoup(str(data_html))
# loop through table, grab each of the 3 columns shown
# Scrape the neighborhood data from the table in the wikipedia page of Toronto
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
cell = {}
if row.span.text=='Not assigned':
pass
else:
# Create three columns named as "PostalCode","Borough" & "Neighborhood"
cell['PostalCode'] = row.p.text[:3] # store only first three letter from the test of <p> tab.(Ex: M3A )
cell['Borough'] = (row.span.text).split('(')[0]
cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
# here we replace some symbols like "(" , ")" , "/" from the neighborhood name(Ex: (Parkview Hill / Woodbine Gardens))
table_contents.append(cell)
df=pd.DataFrame(table_contents)
# compress some big borough name by smaller one
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
df.head()
This is the created dataset that we'r going to use. This dataset have 3 columns i.e "PostalCode", "Borough", "Neighborhood". As the dataset is unstructured and dirty we need some data pre-processing to clean the dataset.
# save this dataframe in a CSV file
df.to_csv('Neighborhood Data.csv')
In this step we'll do these following steps
# drop rows having null value and value assigned as "Not assigned"
df_dropna = df.dropna()
empty = 'Not assigned'
df_dropna = df_dropna[(df_dropna.PostalCode != empty ) & (df_dropna.Borough != empty) & (df_dropna.Neighborhood != empty)].reset_index(drop=True)
# check for missing value
df_dropna.isnull().sum()
# Check if we still have any Neighborhoods that are Not Assigned
df_dropna.loc[df_dropna['Borough'].isin(["Not assigned"])]
df = df_dropna
df.head()
# shape of dataframe
df.shape
Now data is cleaned and all the requirements are met. So we just have to add the Latitude and Longitudes of each location.
Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. Now we are going to create a new table with the Latitudes and Longitudes corresponding to the different PostalCodes
# get the latitude and the longitude coordinates of each Postal code
geo_url = "https://cocl.us/Geospatial_data"
geo_df = pd.read_csv(geo_url)
geo_df.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
geo_df.head()
Now we'll merge the geographical dataframe with neighborhood dataframe according to the Postal Code
# Merging the Data
df = pd.merge(df, geo_df, on='PostalCode')
df.head()
# lets find out how many neighborhoods present in each borough
df.groupby('Borough').count()['Neighborhood']
df_toronto = df
df_toronto.head()
# Create a list and store all unique borough names
boroughs = df_toronto['Borough'].unique().tolist()
# Obtain the Latitude and Longitude of Toronto by taking mean of Latitude/Longitude of all postal code
lat_toronto = df_toronto['Latitude'].mean()
lon_toronto = df_toronto['Longitude'].mean()
print('The geographical coordinates of Toronto are {}, {}'.format(lat_toronto, lon_toronto))
# This will color categorize each borough
borough_color = {}
for borough in boroughs:
borough_color[borough]= '#%02X%02X%02X' % tuple(np.random.choice(range(256), size=3)) #Random color
map_toronto = folium.Map(location=[lat_toronto, lon_toronto], zoom_start=10.5)
# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'],
df_toronto['Longitude'],
df_toronto['Borough'],
df_toronto['Neighborhood']):
label_text = borough + ' - ' + neighborhood
label = folium.Popup(label_text)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color=borough_color[borough],
fill_color=borough_color[borough],
fill_opacity=0.8).add_to(map_toronto)
map_toronto
CLIENT_ID = 'CURLH5YYCXMLJUABNE5Y22LK1JNKWHZLO5MCW2OD4PRRRDK1' # your Foursquare ID
CLIENT_SECRET = 'O5PCL405KIK4MGGBIMJD2EIAYSEIQK03W4QMEG4L4ZYOEMMF' # your Foursquare Secret
VERSION = 20200514 # Foursquare API version
print('Credentials Stored')
First, let's create the GET request URL
def getNearbyVenues(names, latitudes, longitudes, radius=500):
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
print(name)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
#Get venues for all neighborhoods in our dataset
toronto_venues = getNearbyVenues(names=df_toronto['Neighborhood'],
latitudes=df_toronto['Latitude'],
longitudes=df_toronto['Longitude'])
toronto_venues.tail()
Lets check how many venues are there per neighborhood
toronto_venues.groupby('Neighborhood').count()
print('There are {} uniques vanue categories.'.format(len(toronto_venues['Venue Category'].unique())))
print("The Unique Venue Categories are", toronto_venues['Venue Category'].unique())
"Italian Restaurant" in toronto_venues['Venue Category'].unique()
As the column "Venue Category" contain categorical value.So we need to convert it to numerical values by one hot encoding.
# one hot encoding
to_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
to_onehot['Neighborhoods'] = toronto_venues['Neighborhood']
# move neighborhood column to the first column
fixed_columns = [to_onehot.columns[-1]] + list(to_onehot.columns[:-1])
to_onehot = to_onehot[fixed_columns]
print("shape of dataset after one hot encoding is : ",to_onehot.shape)
to_onehot.head()
Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
to_grouped = to_onehot.groupby(["Neighborhoods"]).mean().reset_index()
print(to_grouped.shape)
to_grouped.head()
Here we only require the "Neighborhoods" and "Italian Restaurant" columns for the clustering. So we'll group these two columns.
ita = to_grouped[["Neighborhoods","Italian Restaurant"]]
ita.head()
# rename column "Neighborhoods" to "Neighborhood"
ita = ita.rename(columns={'Neighborhoods':'Neighborhood'})
We will use k-means clustering. But first we will find the best K value using the Elbow Point method.
# drop "Neighborhood" column from the dataframe
X = ita.drop(['Neighborhood'], axis=1)
# find 'k' value by Elbow Method
plt.figure(figsize=[10, 8])
inertia=[]
range_val=range(2,20)
for i in range_val:
kmean=KMeans(n_clusters=i)
kmean.fit_predict(X)
inertia.append(kmean.inertia_)
plt.plot(range_val,inertia,'bx-')
plt.xlabel('Values of K')
plt.ylabel('Inertia')
plt.title('The Elbow Method using Inertia')
plt.show()
Here,We saw that the optimum K value is 4 so we will have a resulting of 4 clusters.
kclusters = 4
toronto_grouped_clustering = ita.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
# unique value in target column
np.unique(kmeans.labels_)
Now create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
to_merged = ita.copy()
# add clustering labels
to_merged["Cluster Labels"] = kmeans.labels_
to_merged.head()
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
to_merged = to_merged.join(toronto_venues.set_index("Neighborhood"), on="Neighborhood")
print(to_merged.shape)
to_merged.head()
# sort the results by Cluster Labels
print(to_merged.shape)
to_merged.sort_values(["Cluster Labels"], inplace=True)
to_merged.tail()
Lets check how many Italian Restaurant are there
to_merged['Venue Category'].value_counts()['Italian Restaurant']
We see that there are a total of 46 locations with Italian Restaurants in Toronto
We will create a new dataframe with the Neighborhood and Italian Restaurants
# create map
map_clusters = folium.Map(location=[lat_toronto, lon_toronto], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(to_merged['Neighborhood Latitude'], to_merged['Neighborhood Longitude'], to_merged['Neighborhood'], to_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster))
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill_color=rainbow[cluster-1],
fill_opacity=0.8).add_to(map_clusters)
map_clusters
If we run the above cell,we can see the visualization on Google Map but when we'll upload this notebook on Github the Map visualization will not show. As Github doesn't support Google Map Visualization. So i've uploaded the Map visualization image on next cell from my drive.
ita["Cluster Labels"] = kmeans.labels_
ita.head()
objects = (1,2,3,4)
y_pos = np.arange(len(objects))
performance = ita['Cluster Labels'].value_counts().to_frame().sort_index(ascending=True)
perf = performance['Cluster Labels'].tolist()
plt.bar(y_pos, perf, align='center', alpha=0.8, color=['red', 'purple','aquamarine', 'darkkhaki'])
plt.xticks(y_pos, objects)
plt.ylabel('No of Neighborhoods')
plt.xlabel('Cluster')
plt.title('How many Neighborhoods per Cluster')
plt.show()
# How many neighborhoods in each cluster
ita['Cluster Labels'].value_counts()
# This will create a dataframe with borough of each neighborhood which we will merge with each cluster dataframe
df_new = df[['Borough', 'Neighborhood']]
df_new.head()
# Red
cluster1 = to_merged.loc[to_merged['Cluster Labels'] == 0]
df_cluster1 = pd.merge(df_new, cluster1, on='Neighborhood')
df_cluster1.head()
# Purple
cluster2=to_merged.loc[to_merged['Cluster Labels'] == 1]
df_cluster2 = pd.merge(df_new, cluster2, on='Neighborhood')
df_cluster2.head()
# Blue
cluster3 = to_merged.loc[to_merged['Cluster Labels'] == 2]
df_cluster3 = pd.merge(df_new, cluster3, on='Neighborhood')
df_cluster3.head()
# Turquoise
cluster4 = to_merged.loc[to_merged['Cluster Labels'] == 3]
df_cluster4 = pd.merge(df_new, cluster4, on='Neighborhood')
df_cluster4.head()
plt.figure(figsize=(15,5))
# Plot-1 ( Number of Neighborhoods per Cluster )
plt.subplot(1,2,1)
objects = (1,2,3,4)
y_pos = np.arange(len(objects))
performance = ita['Cluster Labels'].value_counts().to_frame().sort_index(ascending=True)
perf_1 = performance['Cluster Labels'].tolist()
plt.bar(y_pos, perf_1, align='center', alpha=0.8, color=['red', 'purple','aquamarine', 'darkkhaki'])
plt.xticks(y_pos, objects)
plt.ylabel('No of Neighborhoods')
plt.xlabel('Cluster')
plt.title('Number of Neighborhoods per Cluster')
# Plot-2 ( Average number of Italian Restaurants per Cluster )
plt.subplot(1, 2, 2)
clusters_mean = [df_cluster1['Italian Restaurant'].mean(),df_cluster2['Italian Restaurant'].mean(),df_cluster3['Italian Restaurant'].mean(),
df_cluster4['Italian Restaurant'].mean()]
y_pos = np.arange(len(objects))
perf_2 = clusters_mean
plt.bar(y_pos, perf_2, align='center', alpha=0.8, color=['red', 'purple','aquamarine', 'darkkhaki'])
plt.xticks(y_pos, objects)
plt.ylabel('Mean')
plt.xlabel('Cluster')
plt.title('Average number of Italian Restaurants per Cluster')
The Neighborhoods located in the East Toronto area(cluster-3) have the highest average of Italian Restaurants which is represented by aquamarine colour. North York has second heighest number of Italian restaurants present. Looking at the nearby venues, the optimum place to put a new Italian Restaurant is in Victoria village,North York(cluster-1) as their are many Neighborhoods in that area but a little number of Italian Restaurants therefore, eliminating any competition.The second best Neighborhoods that have a great oppurtunity would be in areas such as Queen's Park which is in Cluster 4.Having 70 neighborhoods in the area with no Italian Restaurants gives a good oppurtunity for opening up a new restaurant. This concludes the optimal findings for this project and recommends the entrepreneur to open an authentic Italian restaurant in these locations with little to no competition. Nonetheless, if the food is authentic, affordable and good taste, I am confident that it will have great following everywhere.
Here we take an Italian Restaurant as an example. We can do the same process to find the best place or neighborhood